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ABSTRACT 

Evaluations in education often throw avay important 
information because of a penchant for averages. Multiple regression 
techniques are used to estimate the average effect of policies across 
schooluy and usually school performance is represented by the average 
score of its students on an achievement test. The author suggests 
f^ome ways of broadening educational evaluations: (1) to consider 
"outliers," or exceptional performers among schools; and especially, 
(2) to consider other statistics of a school's distribution of scores 
besides the mean, which have an intuitive link to ill-defined but 
still meaningful educ?.tional objectives like equality, mobility, 
success with exceptional children, and attainment of certain nininum 
levels of skills. Tables reflecting the author's research are 
included, (Author/ST!) 
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Evaluations in education, as uIsQwhure, o£ten throw away important 
information bi^causo o£ a penchant for averages « Multiple regression 
techniques often are ucod to estimate the a verage effect of policies 
across schools • And usually the statistic of school performance is 
the average score of its students, say on an achievement test. In this 
paper I suggest some \>;ays of broadening educational evaluations: first, 
to consider ''outliers," or exceptional performers among schools; and 
second, to consider other statistics of a school's^ distribution of 
scores besides the raean, which have an intuitive link to ill-defined 
but still meaningful educational objectives like equality, mobility, 
success with exceptional children, and attainment of certain minimum 
levels of skills. Although I confine my remarks here to the domain of 
education, many apply to other policy areas as well. 

Averages are pleasant to work with, being easily computable and 
often effective estimators of the central tendency of a distribution. 
In evaluations of public education, researchers have been dlshef^ctened 
to learn that the average effect of variations in school policies on 
school average scores Is not consistently and importantly large, once 

"This article is based on two larger studies. A Statistical Search 
for Unusually' Effective Schools (with George R. Hall), The Rand Corpor- 
ation, R-1210-CC/RC, March 1973, and Achievement Scores and Educational 
Objectives > The Rand Corporation, R-1217-NIE, January, 1974. I am grate- 
ful to the Carnegie Corporation, The National Institute of Education, and 
The Rand Corporation for support; to feorge Hall for inspiration; and to 
Franklin Bergcr, Theodore Donaldson, ^tus Hag^strom, Richard Light, and 
Richard Zeckhauser for advice and assistance^ The usual caveat protecting 
these indiviriu.'^ls and institutions from further responsibility is, of 
course, in order. 

The unit of analysis might not lie schools but districts, programs, 
counties, and so forth. For simplicity I shall assume in what follows 
that the relevant unit is the school. 



thti HUuUQUtB* {3uciov}couumlc cUaractoi tstics uaru UulU constant • As a 
fv^ault oi thano Uisappoiuting findings, educators have lashud out, 
altcrnativQly ot simultanoously, at achiovwHiiuttt tost measures, at pub** 
lie schools in gauonvl, at insufficiout IqvuIs oi funding or at too 
nmch funding. 

But porhaps their iro should first bu directed at the ovaluators* 
penchant for averai^^es. Kve.^ if, on average, school policies do not 
seem to greatly affect measurable student performance, might there not 
be some schools that are ox v^ct^tions to the insignificant regression co- 
efficients? And even if policies do not affect the schools* average 
achievement scores, might they not affect the intraschool distribution 
of scores--or the scores of some subset of student8--in interesting and 
important ways? 

These questions have significant policy implications If unusually 
effective schools can be Identified, even if they are rare there is 
hope that their superior performance can be replicated elsewhere in the 
educational system, (And if no exceptional schools exist, we may have 
to consider alternatives radically different from current dissemination 
and diffusion policies--cven to consider substantial changes in educational 
expenditures, or overhauling the entire system.) If alternative policies 
turn out to affect the spread of a school's scores, or perhaps the scores 
of gifted or retarded children, even if such effects "wash out" when we 
look at average scores, they may be very important for policy* 

Searching for Unusually Effective Schools 

Suppose one looks at school mean scores and asks whether, after con- 
trolling for nonschool factors (like socioeconomic status, geographical 
variables, and so forth), some schools consistently have much higher 
"value-added" than others. That is, are some schools consistently above 
the regression line that relates the nonschool factors to achievement 
scores? Might they be called unusually effective schools? 

To find out, I looked at four -arge sets of achievement data, in- 
cluding Michigan (1969-1971, grades 4 and 7), New York City (1967-1971, 
grades 2 throi-gh 6), Project Talent (1960, grades 9 and 12, national 



Av .rch et^al, (1972); Jencks et^ al,» (1972); Mosteller and 
Moynihan 1972) . 
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sample), ami Nov^/ Vork Statu Uistrtcts 0^-'^^^^*l'^71'> grades '3 and 6), 
Since tUero is no accuptod luvnial of thu school policy variablus that 
should be included to capture the schools* true effect, and since 
previous studies have shown that most interschool variation in mean 
scoroj^ is explained by variation in nonschool factors* the study con- 
trolled only for nonschool factors and assumed thac all residual vari* 
ation reprc^sented the school effects (and random fluctuation). The 
study was exploratory, aimed aL finding exceptional schools if they 
existed; therefore, there v>/as liberal experimentation with simple and 
complicated controls, using different kinds of data and different kinds 
of fits. If unusual schools were located > one could not definitely say 
whether their performance was due to school policies or not; but if no 
consistent overachievers were found > the result would be strong indeed. 
In effect thi" study attempted to estimate an upper limit on the probable 
number and magnitude of exceptional schools • 

The findings have been reported in detail elsewhere.^' In summary, 
evidence does oxist that some schools are consistently outstanding, l^en 
such schools wort" found, they composed between 2 and 9 percent of the 
sample and w^-ro from 0.4 to 0-6 interstudent standard deviations above 
the achievement level, expected from their nonschool factors. This in- 
crease corresponds roughly to these schools moving their standards con- 
sistently from the 50th percentile to the 65th ov 70th; on some tests, 
this is almost a full grade Level better than expectation. Howevei , no 
matter how simple the control variables and even assuming that all re- 
sidual variation repres»/nted the effects of school policies, no school 
in any data set was consistently able to raise Its students' scores more 
than about 0,8 interstudent standard deviations. 

Are thes^' incr^'ases important? The study discoveix'd schools that 
•.v\'r.' statistically '^unusual,*' but whether they were unusually effectiv e 
is a f!Uv.'Stien transcend i ntj. mathematics. It dciu-nds v»n one counts 

as important. Can th^ inereases be attributi-d to school policies? This 
fiuestion dcservL's further i.s^^'arch, preferably firld studies; but it was 
Lntervstinu to note thdt, tor the Michijvnn case, the unusually effective 
sch.v^Ls had s i v.ni f i cm 1 1 y !>et t *' r-pa i d and mure experienced teachers, and 
jj-nalK r cLciss^'s, e.vipar. d with the averap.e school. 

^Klit:jaard anc; !!a ! I ( f orthcominp) . 
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Tho importanl lesson > I think i is that ^ot both policy and rust>arch 
purposes, ouo must not ruly solely on averages over all schools* Excep- 
tions to the rule may be more important « 

Lookiuii Beyond the School Moan 

Th^^ research just duscribod looked only at school mean scores as 
the measure o£ success, But even if achievement scores are a useful 
indicator of some aspects of a student ^s cognitive growth, is the school 
mean score t\w iright statistic to use for gauging the school's success? 
For the remainder of this paper, I would like to consider the methodo- 
logical and statistical problems of deciding v^/hich statistics to use 
for evaluating schools, as well as to offer the results of several 
invostigations of thu empirical behavior of some other achievement 
score statistics besides the mean. 

A measure may be useful for assessing an individual's welfare, yet 
the moan of that measure over a group of individuals may be quite unsat- 
isfactory as an evaluator of the group's welfare. To show how this ap- 
plies to a more familiar case than achievement scores, consider the way 
one evaluates income distributions. Suppose a person's economic assets 
form a satisfactory measure of his welfare, either because there are no 
other objectives than economic ones, or because a uniform metric of will- 
ingness to pay can translate other types of objectives into an economic 
measure (under stringent conditionr that can be considered met), or be- 
cause we are concerned for the moment with his economic welfare and our 
appreciation of that is independent of other dimensions of welfare. 
Suppose income is the metric for individuals, and the de-ire is to eval- 
uate the welfare of a group--say, a country. What statistics are appro- 
priate? Most people would maiutain that the national average income would 
not be thu only statistic of interest. To be sure, per capita income is 
widely used to rank nations' economic development and to indicate secular 
trends. Rut no description of a nation's economic welfare would be com- 
plete without: some measure of the distribution of income--its dispersion 
among rich and poor. 

Other statistics of income dispersion might be of importance in eval- 
uating a nation's economic situatjion. The relative wealth of particular 
groups--racial minorities, sexes, ages, and so on--would not be captured 
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by measures oi Inequality for tho whole society. Yet those groups 
might bi> tho targets of many national economic programs, the success 
of which coulwl not be gauged using the national average or some index 
of national income distribution. 

Many assessments of economic well-being also concev.a themselves 
with poverty, usually defined using a threshold below which a citizen 
is called poor. Generally, the mean and the dispersion alone do not 
reflect this concern: The statistic of interest is the proportion of 
the population that falls below the poverty line, whether the line is 
defined absolutely or relatively. Economic policies that combat pover- 
ty would be poorly evaluated using only per capita income figures or 
changes in the Gini index. 

Educational evaluations should be similarly informed about aspects 
of school success beyond the average score. School policies are also 
concerned about equality of outcomes, success with fast and slow 
learners, students from underprivileged backgrounds, mobility and 
educational opportunity, and certain minimum levels of attainment. 
Judging schools only on the basis of average scores overlooks all these 
objectivos . 

Supposing we agree to go beyond just the mean score, two questions 
ariso: (1) Beyond the average score along what achievement measures ? 
(2) Beyond to what s tatistics ? 

What measures? Deciding which form of achievement score to use 
is n.>t easy. In educational evaluation one is not just trying to assess 
tho wtjll-being of a group; one also wants to evaluate the contribution 
of policy-ro Lat^^d variables of the educational systems to that well-being. 
F.-^r system evaluation, one might prefer a value-added or residual measure 
of achiovement, not the achievement scores themselves. The reason is 
straightforward: pupils bring different amounts of intellectual capital 
to their learning experiences because of differing socioeconomic, psycho- 
lot;ical, and genetic backgrounds. Schools with superior students will 
c-nd to attain superior results, but not necessarily because of superior 
5cnv)o I ing . 



on-.Lth, using data from the Equality oC Educational Opportunity Survey, 
found Chat only between 5.85 percent and 7.46 percent of the variation amone 
unadjusted school mean achievement scores is potentially due tc^ school 
erfvcts. Cited in Jencks et al . (1972), p. 178. 

O 
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Thoivforc, many writers havo calU^J lor the use of rt^siUual 
achievouunt scorvs to ovaluat*^ pubi^^w education. Only by taking the 
^jtuJuuts^ varying uouschool backgromul factors into account » thoy 
arguo, can the Uifforancos betweon school scores bo linked to tUi? 
quality of thwr education provided. 

Residual scores also have their opponents • There are a host of 
statistical problems, aot least of which is choosing the appropriate 
control variables. At best socioeconomic measures are proxies for 
the background factors one wishes to hold constant across schools » 
and the predictive power of various controls may differ from conwum- 
ity to community, making residual scores difficult to interpret.^ Some 
argue that residual scores computed from school-level data are subject 
to computational unreliability." Even working with individual rt^sidual 
scores is subject to statistical errors of many kinds. If there is 



Those problems are often recognised by advocates, but usually 
left unresolved; see. for example, Barro (1970) » pp. 203-20^ Dy^jr 
vl97J), p, 526 concludes cheerfully; 

Anyone who examines closely the method I am proposing 
lor assessing the educational opportunities provided 
by schools will find plenty of problems in it, some 
theoretical or technical and s.»me practical. There 
is no space here to discuss these problems, but I am 
convinced that, possibly with some modification of 
the basic model, they can be solved • 

For a less sanguine view, see Cronbach and Furby (1970), 

"Dyor, Linn, and Pattou (1969), implicitly assuming that separate 
regress iv^ns used to c^.«atrol individual scores and school scores for 
background factors were free from error, found that sthool-level re- 
siduals had undesirably low correlations with aggregated individual- 
level residuals for the same schools • 

"^Residual variation could arise from other causes than differences 
ii. school effectiveness: imperfection \\\ measurement, misspecif icat ion 
of background factors, omitted variabl- 5, pcor choice of fitting tech- 
nique, incomplete data, regression towai'i the mean, and the combined 
random fluctuations involved in all the legrossor variables* 



nuUiCv>IIiuoarity bocwocu school variables and nouschool background 
factors » l^urthcr uaccrtalaty is introduced lato tho estimation of 
school eC£ects« ^ 

A nou^statlstical) nomutive problem also attends the use of 
residual scores. Evaluating v^lth residual secures Implies that the 
regression line (relating background factors to achievement) is accepted 
as the normative baseline from which to judge policy » To some educators , 
the fact that the regression line indicates differences in achievement 
across economic cla.«.ses, geographical areas* and racial groups is part 
of the problem and is Itself an indicator of poor performance by the 
educational system. Some educators have maintained that using residual 
scores endorses existing Inequalities as the proper fiame of reference 
for evaluation. 

The choice of measures may depend on the choice of problems one 
wishes to analyze. To evaluate cost^benef f t aspects of educatlon««to 
compare the educational dollar *s productivity with a dollar for defense, 
housing, or tax refundS'*-one may prefer an absolute achievement measure. 
However, for cos t -e f fee 1 1 venes s quest ions --to compare one school or ed- 
ucational practice with another--a residual measure may be better. 

There may be no need to be exclusive. Both measures are useful, 
and both conv'oy different kinds of Information about a school's perfor- 
mance* The wisest strategic then, might be to use both unadjusted achieve- 
ment data and achievement residuals. 

The mean is a useful sunu\aary statistic of a school's performance 
under certain circumstances. But using only the mean for evaluation 
both throws away Information and makes assumptions ^hat are probably 
untenable. Using the moan for evaluation implies: 

An increase in an achievement score of a gl^'en magnitude is 
.alued equi^*alently , no matter where on the achievement scale 
it occurs. (A gain from 25 to 30 Is just the same as a gain 

L 

Gi ea mu It ico L 1 inearity , the significance of each affected variable 
'sill be difficult to interpret. Also^ if the amount of multicoUinearlty 
Vwiries fron ret;rossion to regression, not only will significance tests 
he difficult, but techniques for partitioning shared variance will give 
viiffereat winswers. See Maye:3ke ot^ al^. (1969) and Craeger ''1971). 



from 65 to 70, for example,) But tW assumption is false if 
we care particularly about che attainment of certa.'n basic 
skills, or if high scores arc very desirable. Where educa- 
tional policy does not equally /alue equal-siaed gains on a 
standardiaed achievement test, the mean will not accurately 
reflect educational objectives. 

(b) All students are valued equally (since the arithmetic mean 
adds all students' scores in an unweighted fashion, dividing by 
the total number of students). But educational policy may 
attach greater weight to academic gains among certain students, 
perhaps to overcome past disadvantages or to increase the pro- 
portion in certain academic specialties. Insofar as a policy 
is directed at certain types of students, the m'ean school 
score will not be adequate for evaluation, 

(c) Student i's score is independent of student j's (the mean 
merely sums scores, without adjusting individual scores de- 
pending on the scores of others). This assumptiuu may be 
false for two reasons. First, one may care about the distri"* 
bution of scores across students: the equality of outcomes, 
the amount of mobility, the riskiness of educational outcomes, 
the tails of the distribution of scores. The mean does not 
communicate the distribution, just its central tendency; the 
analogue to income distribution is obvious. Second, if edu- 
cation acts as a screening device or filter for later education 
or for the job market, scores i and j cannot be treated as if 
they were independent. 

Specifying Objective Functions: The Theory Versus Educational Realities 

Which additional statistics should be used in evaluation? This 
question asks for a specification of the ^'objective function" that schools 
should have for achievement scores. An objective function is the formal 
link between objectives and evaluative measures. The idea behind an ob- 
jective function is to assign a numerical value (utility) to every (rele- 
vant) state of the world; the decision problem is to maximize that function 
subject to budget and operational constraints. With such a tunction 



a school i)V prugvam caa bo cvaluatud merely by examining Its utility scora 
and the costs oi attaining that score, 

Xu construct an objective function for achievement scores, three 
questions require answers: 

(a) How does one evaluate one achievement score compared with 
another (or one residual score compared with another)? We 
may tauto logically define some objective function = f(A), 
whore A signifies tne achievement score, or some function 

* g(R)» where R signifies the residual score, but what 
do the functions f and g actually look like? 

(b) How could and be combined into a single, composite 
objective function for each student? 

(c) If one is evaluating schools and not students, how could 
the U^.^ be combined for each student i into a school index? 

Question (a). How does one compare scores of 35, 40 and 45? 
We know that 35 is five points lower than 40, and 40 five points lower 
than 45. But the units here are derived through some standardisation 
process used by the testers, norming scores to some population of stu- 
dents, liiore is no necessary reason why this scale should correspond 
t-^ one's e.vd lu..it: io n ot* those scores. Does one equally value a five- 
p.^Lnt increase whether it is from 35 to 40 or from 40 to 45 (or from 
oO to 65): To aiirfvver this (;uestion a utility function for an individual's 
r-'vor'^ is recjuired. 

Thcor/Ciea LL;.*, ttie cvaLuator could construct this Ptility function 
>y pr-suiiL iiii; t:i^ c is ioa:;aker with choices between lotteries on scores • 

V :' LnsL.'.nc-.- , ;l h/itv-r Lor a student to have a score of 50 for sure 
■r .1 30-'n^ L 111. r/ on seoc/s -jl ^Ui and 75V If you wero indifferent, 

V ^::r uLiliL;-' :."unetiv>;: tvv* t.'i.- .student's achievement could be suspected 
•: .-.'Lur. ^.^:r.\:: ov c:m!- l" j.ion. In the well-known eon Neumann- 

y. v ..nst.L'n r i.-:':r; - a. .i s.'i vM lottery que^itions could ascertain tlio 
.i^Lir- /aiieti.^i *. a r.itionai d^.'C is iouinaker . ' 

*Von X-ju-'.mu aad M r-.enstL'rn (h^U^) . S^-e also Kriodman and Savav;e 
• a Lvuiu ^. ! et'.K'Uta ry expv^sition is found in Rai*:fa (l'.)b8), Ch . -V. 

i\.">che (1 71) 'aad loc.'i educational administrators make explicit their 
e.tility fancti.Mis f.u" diflc'rent kinds and levels of student achiever.Knt 
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It is difficult to prcidict what utility function for achiovemant 
scorus would bo spopified. Di^icisionmakors might v>;ell disagree* One 
answer- -though in my opinion unlikely--is that in fact a five-point 
achiuvumunt scor^Q incroaso would be weighted the same whether it were 
from 35 to 40 or 60 to 65 or anywhere else. In such a case, would 
be some linear function of the score, as in Fig. la. 

Another observer might consider increases in low scores more valuable 
than gains in scores that are already high. If questioned in detail about 
his preferences for a student ^s scores, this observer might respond with 
a U, curve like the one in Fig. lb. 

If one valued achievement gains on both the low and high ends more 
than those in the middle^-perhaps because of an emphasis on slow learn- 
ers and the gifted--a cubic utility function like Fig. Ic might be the 
appropriate representation. 

Suppose one's educational objective were predominantly to ensure 
that the student achieved a score above some minimum level k^--perhaps 
some threshold of needed cognitive skills. Achievement increases be-* 
yond k are relatively unimportant. Then Fig. Id might be the right 
utility function to use for evaluation. 

Clearly the shape of U. might be many things besides linear. 
Different policymakers might choose different functions; different 
programs might want to weight achievement gains differently; and 
utility functions might vary for different kinds of students. Similar 
remarks apply for U : a priori it seems unlikely that g(R) should be 
linear, and no other shape recommends itself as the obvious alternative. 

Question (b) . Suppose we have elicited U. and Up. How can we 

combine them into some overall utility function U^? Theoretically, to 

answer this question one first assesses the interdependence of the two 

functions. Does our evaluation of U. for student i depend on his re- 

A 

sidual score? That is, is the choice among lotteries on achievement 

scores any function of the student's residual score, or vice versa? 

If we hold the residual score fixed at some level R^, do our conditional 

(probabilistic) preferences for the unadjusted score A depend on what 

fixed value is chosen, and vice versa? If not, then the composite 

1 

utility function has an additive representation: 
^Raiffa (1969). 



If ouir preleroncas for acUitJvumunt scoros arc dopendont on tha student's 
residual scove or vice versa, then must bc^ ystimated in a more com- 
plicated way, by asking lottery questions among many possible achieve- 
nient and rosiUuaL score combinations. 

Question (c). Suppose U^^ has been constructed for each student 
!• How can U,^.^ be summed to obtain a school index of success? Once 
again the answer depends on the Interdependence of the components to 
be combined. If (the utility for student k) is held fixed at some 
level (I* do our conditional (probabilistic) preferences for any 

other depend on what fixed level (^\)q is chosen? If not, and 

if the question can also be answered negatively for all U fixed, then 

1^ 2 

U^^ for all students 1,..., n are mutually preferentially independent-.. 
If this independence holds, then (school) can be expressed as an 
additive value function: 



U^(school) = U^i + "in- 

In other words, if mutual preferential independence exists, evaluating 
a school merely involves evaluating each student and summing up the 
utilities over all students in the school. 

Unfortunately from the point of view of analytical simplicity, 
such independence seems not to hold across students. As soon as dis- 
tributional considerations enter--when we care about equality of out- 
comes, for uxample--Lhen our feelings about U^^ d£ depend on the levels 
of the other studencs. Furthermore, if part of the education's value 
is a screening or credentialing device, then each student's scores 
affect the utility of his comrades^ scores. Thert^fore, mutual prefer- 
i-ntial independt'acu aoes not seem to exist. As a result, U (school-^ 



Sou Raiffa (1^)71) for details; Uaiffa (1908, Ch. 9, Sec. 3) for 
an outline of the compU-xities . 

"^Mutual prefurcntlal independenco means that the decisionmaker's 
substitution rate betwcnin and Uxj does not depend on any of the 
values oL components other than i and j. S^ic Raiffa C1971), pp. 74-75. 
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can be assessed only through a vary complicated series o£ tradeoffs, 
holding each U,^,^ fixed at different levels while assessing the remaining 
^T(n-l)* theoretically possible but operationally unpalatable task. 

Using the school mean score as the evaluative statistic assumes a 
linear utility function and mutual preferential independence, neither 
of which seems true. 

Turning from theory to reality, two important facts about educa- 
tion must be reckoned with: 

(1) Local school districts (and, within districts, various inter- 
ested parties) are likely to have different utility functions, 

(2) Practically, it will be extremely difficult to obtain an oper- ^ 
ational specification of utility functions from educational 
decisionmakers . 

These two propositions have serious implications for educational eval- 
uation. Both make the methodology of utility functions less than per- 
fectly applicable. 

The first point implies that the search for a national objective 
function that somehow combines local preferences is futile. Consensus 
on education objectives will not be forthcoming- -and perhaps rightly so. 
In a decentralized educational system, local preferences posse^^s a cer- 
tain autonomy, a certain right to be different. To evaluate all schools 
by the same criteria, with the same utility function, would be an error. ^ 



Note that the current ways of using many statistical methods to 
evaluate schools assume common objective functions (and production 
functions) among schools. Insofar as schools are trying to do differ- 
ent things, regression coefficients relating certain inputs to a common 
output may be misleading; coefficients of multiple correlation may be 
looking at the wrong type of variability; good schools nay merely be 
the ones that are trying to do what one is trying to measure. Even if 
schools share a common objective, they will probably weight it differently 
in their tradeoffs among their other goals. 

There still may be a justification for making evaluations according 
to a single objective function. Suppose, for example, that the evaluator 
is the federal government. A decentralized educational system does not 
preclude the existence of national-Level spillover effects from schooling. 
The federal government would v>;ant to affect the local production of these 
effects through grants-in-aid, legal constraints, taxes, and so forth, 
even If not through overt control; and the federal government could Gval- 
ate its success at doing so with a single national-level objectivo- function 
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The second point means that, in educational evaluation, the objective 
is not specified In advance. The problem, in my opinion, is not that ob- 
jective functions are theoretically impossible to get; the constraint is 
inst«^-ad one of feasibility. Three problems may be mentioned: cost; the 
ticklish task of defining decisionmakers among the many educational 
officials with interests and pretensions; and if there are multiple de- 
cisioimiakers , combining their objectives in a meaningful way. In prac- 
tice one cannot begin with tightly defined objective functions and then 
deducti from them the appropriate way to use achievement measures for 
evaluation . 

From the systems analyst *s point of view, education is the worst 
of worlds. First, there are no well-specified objectives and they prob- 
ably cannot be obtained. Second, evaluations must nonetheless be made. 
Third, the data are mostly restricted tu achievement scores* And finally, 
most existing; large-scale evaluations and governmental data banks use 
only mean scores. We know something about educational ob jectives--not 
a sufficient amount lo draw curves and derive combinatorial rules, but 
enough to know that the present reliance on the mean is inadequate. 

What should be o'.me?^ The situation is somewhat analogous to the 
onv! faced in evaluati.i^ a nation's economic welfare. Clearly the 

that uave utility to the particular spillovers in question. This would, 
of CL'ur.SL-, bo a very Limited sort of evaluation, l)ut perhaps this is all 
the i.'eu,'raJ. .i.ovcrnnicaaL ought to attempt in a decentralized system. 

^L'ur syste:r. analyst, an ideal type who nonethclc^ss sometimes s[K/iks 
with ih^i <a:.U' \'vicc' as more reasonable people we knov;, might sug^u^st 
r:hc I'.' [ 1 • . i : **Sinc:e your M-^c is ionmak'?rs are diverse and no ma thv^nan ica ] 
ilvj-riti^v. v\'ip, :,v- c-.ni\*L'nit nt ly adduced I'or any one or all of then>, why not 
Sv^Iv./ y.-;/ ' s L.I r. Lb t ics f'^r evaluation' problem by '.'.iving tht:- entire, dis- 
trLi.^vi:; r. ^ s iov each school to all the decisionmakers? !.et tneir 

- I'^-.i" .-."a ir.iiuis wliac i\s important.*' Visiouo of po 1 i cyii'.a--. v rs Lry- 
t' • :.i;;iii«.' iuauir-'ds -'f a is tograir.s , or having; to compute r'.-siMiia) 
.vsu:* .•; iv.'^ tii-. ; r inJlviduaL [K-. l'*. ept i ons oi thv proper eonc.r^U 

.■•r:::'-.;- • . . -^ec'ir L-^ our anaiyst lor, if tiu y do, tht--y oaly 

.: .) . .'<'^ vMnt tv» ovc rvvh',' Ini d^.:e is ionmakers dita, 

:■ i;- I ' : \ -^...\\^. .-r" LiltlL; usl.-. our goal is to provide, a 
iS * i; ^ ■ ;v.? i:\:"-or: iLi'v-' -^tacistie-, that etUTtspond f rou;.-.!! ly ) Lo 

p • 1 ! ;. likt'ly .\iaca t i ona 1 ;:'b;tctives and that cire Ly Lo 
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average incoiue statistic is not enough; clearly, too, no social welfare 
function has been derived from which the appropriate statistics for 
evaluation could be deduced. But there is a notable difference. Uu- 
like education, national economic policy has employed statistics that 
go beyond the mean: measures of income distribution, the poverty line, 
and others. These statistics were not deduced from an objective func- 
tion, and theru is no one set of them that commands universal assent 
as the best and most efficient. But a number of useful statistics 
have been proposed to measure certain ill-defined although meaningful 
goals of economic policy. Rather than staying where we are in educa-* 
tional evaluation, or throwing out achievement tests altogether, per- 
haps we would do well to follow that example. 

Statistics of Spread 

Equality is an increasingly voiced goal of education. In America 
discussions of equality have traditionally centered on equality of 
opportunity: that everyone have an equal chance to obtain a good ed- 
ucation, but not necessarily that everyone actually use thai chance. 
However, many recent v;riters, including some of a radical bent, have 
emphasiiied equality of outcomes as a major educational aim. They main- 
tain that instead of evaluating some prior notion of the opportunity 
schools provide--or porhaps in addition to such an investigation--the 
equality of the actual results should be examined. 

It is not clear that the more equal the educational outcomes, the 
better; one's utility function might not be an increasing function of 
the amount of equality.^ The central point is not that equality is 
preferred indefinitely but that some measure of the equality of out- 
comes that a school provides is helpful in a well-rounded evaluation 
of its effectiveness . 

A school's mean score alone tolls nothing about its equality of 
outcomes (altliough a comparison of school means will indicate something 

Respite the common usage of terms like "equality" as if they were 
to be maximized, there is almost surely some limit in everyone's mind-- 
although, as Kristol (1972) points out, advocates of equality and mobil- 
ity are reluctant to define optimum levels. 
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■about oquaiiuy among schools). To evaluate a school^s equalising abil- 
ity, one noc^ds to go beyond its central tendency to some estimator of 
the spread o£ the school distribution of achievement scores. 

Figure 2 shows two hypothetical distributions of achievement scores 
corresponding to schools A and B, Other things equal, an advocate of 
oquality of outcomes v«uld prefer school A because of its smaller vari- 
ability, even though the school mean scores are equal. 

One statistic of interest, then, is the spread of a school's uncon- 
trolled achievement scores. Other things equal, the smaller the spread, 
the greater the equality of cognitive achievement outcomes. 

Two kinds of residual scores related to the spread can also be 
useful. First, suppose one is interested in comparing schools^ equal*- 
lp:ina abilities. The different degrees of equality within schools may 
stem from differences in nonschool background factors from school to 
school, rather than different equalizing effects in schools. Schools 
having students with more similar backgrounds can expect less variation 
in achievement scores. One could regress some statistic of equality of 
outcomes (say, the standard deviation of school scoras) against various 
background factors to compute a predicted standard deviation for each 
level of the background variables. A residual score — observed standard 
djviation minus predicted standard deviation--could then be obtained for 
lor each school. The smaller this residual, the greater a school's equal- 
iy, ivii; abi LLty . 

A second I'usiduaL spread measure might serve as a proxy for "educa- 
tional mobility," another goal of schools. Americans have long cherished 
the belief that education can be a powerful weapon for social advance- 
r-ent, without students being imprisoned by their socioeconomic backgrounds, 
Jone rt^ceat studies, usinj; mean achievement scores, have eroded this 
faith* But Is the mean the right statistic to measure the effects schools 
have on mobili ty? 



So".ie educators apparently believe that larger spreads indicate 
superior schooling: "Every experienced teacher knows that effective 
t^-aciuay will increase the variance of the group being taught, and usu- 
ally narked ly" (Guba, 1967, p. 61). 



For this mobility objective, the spread of achievement residuals 
may be a useful indicator. (In general, the spread of the residual 
scores will not be the same as the spread of the raw scores.) Given 
schools with equal mean residuals, the one providing greater residual 
variation is providing greater educational mobility. Its students 
have more opportunity to "succeed"-- and more to "fail"--compared with 
other schools whose students have like socioeconomic and personal char- 
acteristics. Putting it another way, the students in a school with n 
larger variation of residual scores are less likely to end up where 
their backgrounds would have predicted. 

As with equality of outcomes, It is not necessarily true that the 
more such '^opportunity'' for success and failure exists, the better. One 
may prefer to have less chance of failure even at the loss of some 
opportunity for success. In 1523 on the Isla de Gallo, Pizarro drew 
a line with his sword ia the sand and told his men on one side lay 
"untold hardships and starvation, treacherous reefs and storms, bitter 
war and even death, but there also the golden land of the lucas" and 
on the other "peace, but the peace of poverty." Only 13 of the hundreds 
joined him on the side of possible riches. Risk preferences and distri- 
butional considerations are important in deciding how much opportunity 
for mobility we prefer.^ The fact that mobility may not be indefinitely 
preferred does not, however, mean that the spread of residual scores 
is a useless measure. It is merely a reminder that "mobility" is two- 
direcliional , and that more of it, in education as elsewhere, may not 
be unequivocally desired. 



Risk preferences are important because people with higher risk 
aversion tend to prefer narrower distributions of outcomes to wider 
ones, given equal expected values. 

Distributional considerations may enter if the residuals display 
heteroscedasticity . (He teroscedas t icity refers to nonconstant variance 
of residuals around the" regression line.) In such cases an increase in 
the overall variance of a school's residuals increases the opportunities 
for students of certain backgrounds more than others; one cannot a priori 
presume that every student has the same probability of being located any- 
where on the school's distribution of residuals. Therefore, which stu- 
dents get more opportunity becomes paramount--and this brings distribu- 
tional objectives into the picture. 
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Their C! are, then, three possi.jlc moasuras of spread that would by 
useful in educational evaluation; the spread o£ the unadjustiod achieve- 
ment scores, im'icating eiiuality of outcomes; the difference between the 
actual and expected spread of achievement scores, a proxy for the equal- 
izing ability of schools; and the spread of the residual scores of a 
school students, indicating the amount of educational mobility a school 
provides. Which statistic should be selected to measure spread? 

There are many possible statistics of dispersion and equality^ 
One is the v^ariance (or its positive square root, the standard deviation). 
However, the variance is very sensitive to extreme values; it is not a 
robust estimator of spread. One estimator of spread that is less vul- 
nerable to outliers is the interquartile range (others are given in 
Tukey, 1970, Vol. I, Ch. 2). 

Which statistic of spread to use should depend on a careful speci- 
fication of the educational objective function; but, short of this, 
what matters is that some such statistic be available. Further research 
should be devoted to selecting the best statistics of spread for educa- 
tion, although as in income distribution, optimality properties may not 
be agreed upon. With any of a number of measures of dispersion, schools 
could be compared cross -sectional ly and over time in a useful way; the 
value of such statistics for evaluation should not be underestimated be- 
cause of some misplaced desire for cardinal precision. 

How much do schools differ in the spreads of their achievement 
scores? Do nonschool background factors explain differences between 
the spreads of schools? Is there any evidence that som'^ schools consis- 
tently provide less variability of scores than otliers, holding non- 
school factors constant? Since spread measures of the intraschool dis- 
tribution of test scores have largely been ignored in the past, little 
is known about the empirical characteristics of such measures. 

The following are merely preliminary investigations into the: be- 
havior of some standard deviation measures based on Michigan data for 
fourth and seventh grades in 1969-70 and 1970-71.^ Since the data were 

^The data base is described in Brown (1972) and Klitgaard and Hall 
(1973). 
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already aggregated at thu school leval, the ^'mobility'' statistic, which 
must be basud on studvuit-lovul rogrussions, could not be computed, Only 
the standard deviation oi unadjusted scoree C^equality*' statistic) and 
t\w dift'orence between the expected and the observed standard deviation 
('*oquaU/:ing ability** statistic) were examined, and these two only in 
an oxploratory i'ashion. 

How should one expect the standard deviation statistic to behave? 
It is the square root of the variance, and it is similarly sensitive to 
extrcnne values in the distribution. In nonnal samples, the sample vari- 
ance is distributed as a multiple of a chi-square variate with N-1 de- 
grees oT freedom. With N small (say, less than 10), the chi-square 
distribution is positively skewed; but by N = 20, the distribution is 
close to Gaussian. The standard deviation tends to have higher vari- 
ability lor smaller N; schools with fewer students tested will have a 
higher proportion of high and especially low standard deviations, other 
things equal. 

In the Michigan data N (the number of students tested ptfi.- grade) 
varied considerably from school to school (see Table 2), making school 
standard e.eviatLons not perfectly comparable; but since the average 
value 01 N was quite large, the analysis simply used the standard devi- 
ation without worrying; about transf onnations . Eliminating all schools 
\<it;\ N • r- , the averaj'.e school standard deviation was about 9 and the 
st::uicarvi deviation of the standard deviations v/as about 1.1 (see Table 

1) ♦ fh^ d istr ibuti oas of school standard deviations across 

> 

seaoeis ..'^r^ ne-ativeLy skewed."* This fact mij^Ut well be the 



T!ie achievement tests are normed to have an interstudent standard 
Jc'. iat iv^.n o£ 10 and mean 50. 

'^The data cover reading and mathematics scores for fourth and 
seventh grades in Iv6*')-70 and 1970-71, a total of eight sets. Not 
i.:Vc ry school has be :h fourth and seventh grades, and not every school 
reported data for each possible test/grade/year combination. The skew- 
nes^ statistics were: 

R A 69-70 = -0.48 R 4 70-71 = -0.62 

M 4 69-70 = -0.47 M 4 70-71 = -0.28 

R 7 69-70 = -0.68 R 7 70-71 -0.64 

M 7 69-70 = -0.94 M 7 70-71 = -0.98 

R 4 b9-/0 stands for the reading score for fourth grades in 1969-70; 
the other symbols are interpreted similarly. 
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rcsuU ol lower variances of smaller schools* It also might indicate 
that some schools uru trying; to obtain more equality of outcomes than 
oth^'rs, or are better at doing so than other schools with sin^ilar goals. 

Ho\v do thest» standard deviations compare v^;lth those expected, given 
t\w diiiereut baekgruund laetors among the students oi different schools? 
To tind out, a series of regressions were run, fitting the school stand- 
ard deviation to a number of nonschool background factors. The best set 
ol rej;resslons , although still only crude and exploratory, is given in 
TabLo 1. lahU' 2 shows the means and standard deviations of the regres- 
S'.u* and response variables. 

Tiio proportion of variation explained by the regression results 
varies rather widely, from 0,11 to 0.37. No differences seem important 
buL^v'oou LiiJ reading and mathematics regressions, although the reading 
scoros display more heteroscedast icity as indicated by the greater sig- 
nificance of the regressor. (This difference is most striking between 
tile lourth i^rade reading and mathematics scores.) SESO has tlie expected 
positive si^:n on all regressions. '/MIH is consistently negative, indi- 
catiuu that greater numbers of minority students tend to go along witli 
tiK' lo^vor standard deviations, even after controlling for SES and the 
acnLeveinent score ^. The number of students tested N has the expected 
po-;ii:lve si-iw^n, Indicating that smaller schools do tend to liave smaller 
\ :\ r lao 1 L i ty . 

T'\c' na ;or finding of titese regressions and the others tliat were 

L:;-. d is tile Liruted ability oi back^u'^end factors to predict school 

.-.iMtviar vJ d^,'V ia c i ons . This result, of course, contrasts markedly with the 

r -salts vU* rei*ressions on school means, wlicre most of tiio variation 

.tv^.-ss scti.*^'^l.s is explained by socioeconomic, racial, and regional var- 

i.td-.s. (,;\oj t'Avimple, tlie R"^ values for simple regressions on means 

•a.-oin.: tile sar.u Michigan data ranged from 0.59 to 0.78 (Klitgaard and 

a.ill, lv7J, p. ^0).) One might hypothcs i;?:e that tlie low explanatory 

p.-^-.v.r backy.r^Mind factors indicates that scliool policies ^i.eterminc 

2 

standard deviations. But the low R values may merely be a prc^duct o£ 

'^The statistical properties of the standard deviation statistic 
\v.Hild h-ad one to expect smaller variances for schools with small N, 
even if all schools liad drawn their students randomly from the same 
popr.lation; it also may be true that smaller schools tend to have more 
hvMViov.eneous student bodies, even after controlling for SKSO. 



i^rt^aier i\i;;i*:n^. r luwr:»»aiion or pur;2ly statistical pr'^blcr.u>% This quo^r-- 
zic'A •;vaitw; U;.'*:ai it^vl Vw»t ijjat Lon» 

The rcsivlual^ ii\H\\ thusi^ rogiossious coustituteU tlu? sucoml spread 
measure discussod abovu--a statistic purporting to indicate thu aqual- 
iaiUi; ability of schools given thuir students* back tu'ounds • The dis- 
tributions were slightly tighter: The standard deviations (ot the 
standard deviations) now averaged about l.O. Skewness was reduced, 
althouiih all eight distributions are still negatively skewed,^ Out- 
liers remained on the left tail, but a few also showed up on the rij;ht 
tail now. 

The extreno values o:\ the left tail looked interesting e:;ou?/n tc 
pursuit iLach lustcgran^. o: schools^ scores (say, for a particular *;rv'.de, 
te:'.t, a:ii year) will show the e fleets of raudon variation at5 well 
the etfect of different sch.ools, A thick left tail does not by tisv^li 
prove tt'.at thes^^ schools with low variability are anything u:ore than 
rando.r. devi:i:es. Bu: if the s;ir.^» schools s:\ow up on the loft tail con- 
sistently over v.any r^rades, teste?, and years* one ndght conclude that 
the phenomenon is not just a statistical fluke. Do sox9 schools con- 
sistently record low variability, even after allowing for nonschool 
background factors? 

To find out, the following null hypothesis was formulated: All vari 
ation of the difference between actual and expected standard deviations i 
a result of chance and not of school effectiveness. To test this hypoth- 
esis, soue sere of **cunulative distribution" is required indicating how 
well schools have done over nany grades, tests, and years after control- 
ling for buCNgro'ir.d factors. Then it would be possible co see if that 



N^iivc vJspocLaLl'* h.^v i:;e muthenut: ios scorvjs have become less skewed. 



statistics were : 



R 4 69-70 - -0.:0 

M U f)9-70 » -O.IA 

P. 7 o9-:o « -0. .>6 

M 7 hQ-:o = -J.lh 



R 4 70-71 « -0.26 
M 4 70-71 « -0.01 
R 7 70-71 = -0.3? 
X 7 70-71 ^ -0.19 



BEST COW AWWlABtE 

i 

distributicii dlfi'urod signi: leant ly iron; ^\ thi^aratical distributiu:; cb- 
taint^J by iroat:!;;^ dll the inuividuai dist ribuc lens oi ivsiduals 
^tat lstic:%illy indupcuJiM^t • 

As a proxy tor this cumulativt^ distribut:ioa, v^ach school iu a 
givtu\ ^rado, test, and yoar distribution was assigm'^d a score of ouo if 
it was more than ono staiidard error bo low the mean and a score of sero 
otherwise ♦ Each school totals wero added up over all distributions, 
and a chi-square test was used to see whether soue schools were con*5is- 
tently below one standard error more than chance would predict* The 
results appear in Table 3.^ 



A deviation fren^i the assumption ot perfect independence of the 
various test scores was necessary to take account of the correlation 
between voading and r.athematics residuals in tests taken by tiicj satrie 
class In tho same year* The tree below shows how the ^'ight residuals 
were generated; 



1969-70 1970-71 




RM RMRM RM 



Sj.nce the R-M residuals for a given year and grade are not independent, 
the null hypothesis was reworded to posit that the pairs of scores are 
independent , 

Let be the number of scores in a school's readin-^-nathenat ics 
pair (R.',M,) that are one standard deviation below the iriean. X has the 
possible valueb 0, 1, 2* N'ow compute a total score T, for each scl^ool 
where Tj = X^^ + • • . ^^'^ nur.:ber of pairs of scores the 

school reported). Assuming the X^ are independent, coit^pute null dis- 
tributions for T, using the actual probabilities of 0, 1, and 2 sue- 

cesses per pair. Then the actual distribution can be compared with, 
the null distribution using a chi- square test. 

The actual probabilities for each pair of tests arc: 



Tht.ru is oviUcmco in Table 3 that somiJ vSchools consistently havu a 
v-ruator uqviali.x tuj^, latCoct on their students' achiev^^ment scores tlian chancu 
alone would predict. The schools that woro consistently below avcra^'^e did 
tond CO be quite a distance bolow each time, For uxamplt», thu ten schools 
that wtire below one standard deviation at least five out of eij^ht uimus 
averaged about 1,6 j bolov ouih ciu-^e. Since th^ standard errors wort? 
about one z^si ^jcorc^ poi;;t r%:\d the interstvident r « 10| tht^^o ten 
schools were rt-duclue; tr^u variability or their students* scores abcut 
1/6 01 the i;;terstudont v::riation coi:;?arad with the average school • 
On the fourtr* grade Icv;a reading test, this would imply tightening tho 
standard devi iv.ic^^, of outcones about 20-25 percent of a grade equiva- 
lent.^ 
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4 69-70 


0 . 302 


0.149 


0.049 


7 69-70 


0.823 


0.124 


0.053 


4 70-71 


0.804 


0.139 


0.057 


7 70-71 


C.S45 


0.102 


0.052 



If the school reported ei^.ht scores, it had eight chances to be below 
one standard deviation leos than the mean; the null hypothesis is ccn- 
puted tor fcur pairs ot" tests. If a school only had six chances, then 
the test icJ oor.\puted frcn three pairs; if four chances, two pairs, ITie 
chances only occurred in reading-matheraat tcs pairs Uiny schcol that re- 
porte.I a reading Sv'.ore cor a given grade and year alaio reportv-^d a math- 
enati-s s:oro tnat gride and year;. Kor sirrpliuity in calculation, 
I ass'iceu 3 ccrr.j^. ::rjDaDlLi':y dibtribuciou PO'Jj - 0.32, PCX-^L) - 
0.13, '.•03 lor all pairs and asii.ir.-.^d it did not mutcer wruch 

particular piirs i;.:pponed to r.ak^r up school's se: of chances. 

Fo r the i h I-squar-a ap^roxin •^l on a.^.vHirat--- in cent i:';^'.:»nc".' 

tables ui:^ :-,orc t!;an one aej^ri^e of freeJo:a, cells with small expecta- 
tions r.ius*. be .>o./Led. I ullowed a pooling rule pioposed bv Yarnold 

If the nuz;ber of classes is cbree or n^re, and if 
y dfuott^s the nMniirer of expovir a*, iouji Ics^ than five , 
then t:>w» r.i.i;ir;un e>:pectatio:v .Tiriv be as a:".all as 5i'//. 

''Lindq'v;i'it ^:.d ro:vvr:-.s fl9(-i). To j.lve a:y:>ther tntuitivo idea 
of wr.i: t:il-'> :vJ.v':ior. ii\ v^irlnbiivty int^j/.!,-;, a i/t reiucticMi in the 
f:anda:'.l .if.'.-. r.OL-t Lv» l^sLb w>.uld be 1-'^ u.Jints. 
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IL must, bo roomphasiiiuU that these results arc only explorations • 
llw.y havo baroiy touclitiU t\\^ surface of tht^ important questions concern- 
tuvj jjtandarvl wii^^'iation ami other spread measures in education* How do 
dlflerent uieasures o£ spread behave? How important is the variability 
involved? Hv^w dot:s spread relate to school and background characteris- 
tics" Perhaps tihls beginning can whet some appetites and suggest some 
directions tor further study. 

StaLxHtics of Distortion 

In recent years especially, educational policy has laid heavy 
stress on special programs for disadvantaged and gifted students* 
Spurred by the conviction that curricula and methods designed for the 
av«;UMge pupil do not teach slow and fast learners efficiently, reform- 
ers have created programs for special students at an unprecedented 
rate. Evidently, many educators base their judgments of school quality 
partly on the number and sophistication of programs for different kinds 
01 studt^nts. If educational policy is significantly directed at slow 
or fast students, a school's average achievement scores may be a mis- 
leading measure of its success. 

Take the case of uncontrolled achievement scores. Suppose very 
low scores are very undesirable, very high ones extremely nice, and 
those around the middle more or less the same. Low achievers might be 
harmful to society to a far greater extent than the linear weighting of 
tlioir achievement scores would indicate, while high achievers might be 
deemed ext.remely valuable. In this case, the utility function might 
look like the cubic function in Fig. Ic. We may be willing to let 
those in the middle achievement range drop a little if we can thereby 
move both tails of the distribution of scores to the right. For example, 
in Fig. 3 we may prefer school A to B, and either A or B to school C, 
despite equal means and variances. Distribution A has more students 
below the mean than B, but most are in the range where it does not mat- 
tor too much; meanwhile, A*s lower tail is smaller and its upper tail 
broader. "'"^-^^ 
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ifiiti proxy for Huoh preffer^t;cfej?« might be the akevness of the distri 
but Ion » defined as 



Positive (nt:gacive) skewneiis indicates that for any specified rauau and 
variance, the mode is likely to be smaller (larger) than the nean, tl»e 
left tail "unusually'' short (long)> and the right tail ''unujr^uaily*' louf 
(short). Increasing the positive skewness of a school's distribution 
of scores trades off losses around the middle of the distribution for 
gains in scores on both tails. Other things equal> much of eduL:ational 
policy probably favors positive sketvTiess, 

Similar remarks apply to the skewness of the school's distribution 
of residuals ^ Fit individual student scores against their nonscijool 
background variables; compute individual residuals for each student; 
then aggregate those residuals by school and compute the school's skew- 
ness statistic for tlie distribution of residuals. Suppose that we care 
more abcut uncorachisvers and overachievers (ro matter what i\\e score 
their bo::kgro'i;id factors would predict). If we wish to avoid lar;.^v 
^inderacliievers and produce large overachievers, and if we do not care 
much about performances relatively near to exp'^ctation i thenj other 
things equal, tr^e skewness of the distribution of residuals rr.ay validly 
order schools according to our preferences. 

Because ihe skewness statistic is a nonlinear functional, strL.-.iiy 
speaking thovL* L-y nu vcu Neur»a::n-!tor;gcnste m utility fuact;'on co: r'S« 
ponding tc it'i r»i:<i!::i^rat ion . However, despite this rather ':n.:ai::iy 
fea:ur:?, .-statistic has a ^:istcry of use in e':o^\-:-.^ • ^ : • 

studies CO "i^-af.-;r»r exactly the p[v!:;c:r.f^na relevant here: h*i.''. s i.s 

on lariT-^ T.os^-:l':t?. f::ivoffs j^rv.i .:r»iit ii^pleasure at; large n"..i :t>'S 
("rint:...r, liL-lo;, 19?u; •^rdi^^i, l^h7 ; I'i:;h^T and Hal:, 

Usvi L;; r i^r s 1 o:i s th.i^ aL.;j .o.:t:ro!. f-rr r:.'<'A\ awl /ariar^'**, ':■ - ^ 

j^: : i.--cv sjj^;; r^: ^-^ii^: vii.jtorcion of • . - 

schujl Ui ;^ 1 r/D.-.: an appropriate aiiirlouai noa.-,>ft * • ■.•.■;/..<.- 
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Once again, the precise mathematical definition of the statistic 
oi distortion to be included is not o£ prime importance, nor would one 
- X : . prefer positive skewness indefinitely. What matters is that some indi- 

cator of distortion be available as an evaluative tool.^ Other things 
equal y the more positively skewed the distribution of raw scores 
within a school, the better a school is dulng with its slow and fast 
learners, although at the expense of its average students. And for 
individual residuals, with other things equal, the more positively 
skewed the distribution within a school, the better a school is doing 
with its under- and overachievers, although at the expense of students 
who perform at about the level predicted by their socioeconomic back- 
grounds . 

Statistics of Proportions above Certain Thresholds 

If some minimum level of attainment is of concern, the mean school 
score can easily mislead. A simple and useful measure is available; 
the proportion of studeuts who score above the level in question* 

A number of writers imply that certain thresholds of achievement 

2 

are of the utmost concern. High schools are sometimes judged by the 
proportion of their graduates that can read at the ninth-grade level 
or that go on to college, to name two quite different thresholds. In 
performance contracting experiments, fees often depend cn the number 
of studeutc performing at or above their grade levels. For such 



There are problems with the skewness statistic. It is extramely 
sensitive to outlying values --more than the variance or the mean--and a 
nore robust estimator might be called for. Another problem concerns 
the fact that one's preferences for skewness cannot be separated from 
one's preferences for mean and variance. Even to find a function that 
ranks distributions in the same order as maximizing the third moment 
of d distribution E(X - p.)^ involves specifying the mean and variance 
as well. However, with some such measure one can obtain further in- 
fomation that generally goes beyond the mean (which weighs all gains 
and losses the same no matter where on the distribution they fall) and 
the spread (which evaluates bigger tails on either end the same). This 
fact implies i lack of preferential independence among the goals relat- 
ing to mean, variance, and ske^aiess of a school^s distribution: How 
much skewness one prefers has to depend on the level of the school's 
mean and variance. 
2 

A lower tail threshold is Implicit in the writings of Kenneth 
Clark, for example. Similar sentiments may be discerned in the writ- 
ings of John Stuart Mill: 
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objectlves, the proportion of students above a certain score is the best 
Indicator of success t 

As with the other statistics discussed so far, the proportion above 
certain thresholds has useful applications with both uncontrolled and 
residual scores* The proportion of students above some absolute level 
tells us one thing about a school; the proportion achieving above some 
level relative to their backgrounds, quite another. Both measures usu- 
ally go beyond the information provided by means, variances, and skew- 
ness • 

Some crude indications of how threshold measures behave can be 
gathered from data from the Yardstick Project in Cleveland, Ohio. Yard- 
stick contracts its data analysis services to some 34 school districts 
in Ohio and other states. Its clientele varies from year to year, as 
do the clients' data requests: Some ask for an^lycGs of lower elemen- 
tary grades and some upper, and over varying time spans. Thus the data 
base is not necessarily representative nor is it useful for longitudinal 
analyses. However, t.he Yarkstick data bank stratifies school data in 
interesting ways. For instance, it provides growth-per-year scores 
stratified by five IQ levels and five categories of father ^s occupation. 

For 72 schools separate regressions were run on school mean growth 
(mean score for year N minus mean score for year N-l), school mean 
growth for students with IQs higher than 123, and school mean growth 
for students with IQs lower than 93. Control variables included father's 
occupation and mean school IQ, among others. 

Background factors do not predict success with slow and fast learners 

nearly as well as they predict school success with average students. For 

2 

the school means, a stepwise regression yielded R - 0.55. The other fits 
were very poor. In the regression on school mean growth among its stu- 
dtjnts with IQ ^ 123, only the percentage of children in the school whose 

'*It may be asserted without scruple, that the aim of all intellectual 
training for the mass of people should be to cultivate common sense; to 
qualify them for forming a sound practical judgment of the circumstances 
by which they are surrounded. Whatever, in the intellectual department, 
can bo superadded to this, is chiefly ornamental.*' ( The Principles of 
Political Econo my, Book II, Chapter XII : cited in Vaizey, 1962, p. 20). 



fathers were skilled workers was significant (with a negative coeffi- 
cient), and the was only 0.18. On the under 93 side, no variables 

reached the F > 4 significance level needed to enter the regression, and 

^ 2 
when all controls were forced into the fit, the R rose only to 0.13. 

These results suggest, but shortcomings in data did not enable me tu 

verify, that school variables may make more difference than background 

factors in determining the achievement of exceptional children, either 

because schools concentrate their efforts there or because schooling 

with uniform emphasis across children affects some children more than 

others • 

Practical Considerations and Conclusion s 

To restate the problem: Large-scale educational evaluations and 
government data systems often throw away useful information. This 
problem is not severe with intensive, small-scale studies; they have 
the time and resources to do thorough data analysis. But large-scale 
surveys, proposed "accountability" systems, and government information 
banks rely almost exclusively on average scores and average effects. 

Given this situation and the continual need for policy decisions, 
there are three undesirable alternatives. First, one can forgo achieve- 
ment data altogether, relying instead on less quantitative evaluative 
criteria. Second, one can choose to remain with average scores alone. 
Third, one may insist that evaluation cannot properly take place with- 
out a complete specification of educational objective functions for 
every level of government, every type of program and target population, 
all regions, every type of student, and, indeed, for every educational 
decisionmaker. 

This paper has recommended a course of action different from all 
three. Although existing tests have shortcomings, some knowledge is 
better than none and therefore let us not abandon cognitive achieve- 
ment measures. The mean is easy to use, but more knowledge is better 
than some, so we should go beyond simple averages. And although objec- 
tive functions for evaluation are elegant, their practical application 
in education faces overwhelming obstacles. 

The measures proposed here need further research before their 
exact properties are understood. Vfhich exact statistics and which 
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estimators to employ are opcsn questions. As in the t;ase of income 
distribution, thero may be legitimate debate about which statistics 
are best. But also as in that case, the argument is that some such 
mtsasures are better than none. 

How should these statistics be used in the near term? Crude 
measures should be employed crudely. Continuous, cardinal uses of 
the statistics proposed would probably mislead more than they would 
help, A move away from pseudo-exactness is advisable. One might 
divide each perceuLile measure into five or so categories (say, the 
highest 20 percent of schools on each measure would receive a one, the 
second 20 percent a two, and so forth). (See also Dyer, 1972.) One 
might then envision a scheme like that shown in Table 4. 

One should resist the temptation to concoct a grand measure, some 
weighted sum of all ten suggested statistics • Weighted sums assume 
mutual preferential independence, which does not hold for the 
proportion measures mathematically, and probably does not hold 
(given most reasonable objective functions) for any of the measures. 
Although complicated algorithms expressing conditional preferences are 
possible, it is best not to include these formally in any data system, 
accountability scheme, or large-scale evaluation. Let each decision- 
maker (and each citizen) be his o\m judge. 

To propose thu introduction of new measures without clear-cut objec- 
tives flies in the face of rationalist predispositions. But new measures 
even imperfect ones, can be the i:irst step toward educational change. 
James March has suggested that most rethinking of objectives that does 
cake place in organisations occurs precisely in a ''backward*' fashion-- 
fror.i changes in performance indicators to chanj^es in goals and operations. 

Lsin;:: new statistic^ may shift discussions between educators and 
iv.iluators from questions of overall levols of performance to questions 
n" cr:i;Lty, nobility, special proi;ram8, and thv. rest. One might imagine 
caSLcs Ltuit show the tradeoffs amon^^. objectives that choices of differ- 
ent policies imply. The new statistics would not only more faithfully 
reflect tht! multiple and varied nature of educational objoclives, they 
ni :rit also stimulate new concerns and croat^^ new incentives for action 
^jr avoid some unwelcome old ones). 



Looking for outliers In education and looking beyond a school's 
average score are two steps away from simple-minded evaluations. 
Unfortunately there are no well-developed methodologies or canned 
programs for doing either. There is much art, and much judgment, 
in discovering exceptional performers; in addition, there are nor- 
mative questions involved in deciding which statistics to call the 
measures of school success. Computational ease and custom favor the 
use of means. Policy relevance favors going beyond them. 
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Fig. 1 — Some plausible' shapes for utility 
functions for achievement 
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Fig. 2 — Schools with equal means and unecual spreads 




F!g,3 — Schools with equal .'nccns and variances 
but unequal sl«ewnos> 
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Table 2 

MEANS AND STANDARD DEVIATIONS OF REGRESSOR 
AND RESPONSE VARIABLES 







u 


0 




u 


•J 


R469- 


70u 


50.5 


4.0 


SES 469-70 a 


8.8 


1.4 




c 


8.9 


1.2 


SES 769-70 a 


8.6 


1.2 




N 


62.8 


34.7 


SES 470-71 0 


8.8 


1.4 


M469- 




50.5 


4.0 


SES 770-71 0 


8.8 


1.4 




J 


9.0 


1.1 


ZMIN 469-70 


10.7 


23.7 




N 


62.8 


34.7 


%MIN 769-70 


10.5 


22.6 


R769- 


70u 


50.3 


3.2 


%MIN 470-71 


10.1 


22.8 




0 


9.2 


0.9 


SMIN 770-71 


9.4 


21.1 




N 


172.7 


130.2 








M769- 


70u 


50.4 


3.8 










a 


9.0 


1.1 










N 


172.4 


129.8 








R470- 


71'. 


50.6 


3.9 










3 


8.9 


1.3 










N 


63.5 


34.4 








M470- 


•71y 


50.6 


4.2 










a 


8.9 


1.1 










N 


63.4 


34.3 








R770- 


■71u 


50.6 


3.3 










0 


9.2 


1.0 










N 


182.5 


133.9 








M770- 


71u 


50.6 


3.9 










a 


9.0 


1.0 










N 


182.1 


133.3 
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Table ) 

RESULTS OK CHI -StJl'AKK IKSVS OK LnFFFUI-NrKS Ull l'WVliS 
OBSKRVKD ASU KXl'aVKU DISTRIUrTloNS OK RI-Sim:,\LS 



Schools Reporting ^ Tl?\e!i 


Schools 


Repoiting 


0 Vlmes 




No. 


Observed 


Kxpected 


No. ■ -1.' 


Observed 




No, -I • 


Obiif rvtfd Kxpuc t od 


0 


5J 


63 


0 


61 


7S 


0 




1 


26 


40 


1 


33 


36 


1 




2 


23 


2S 


2 


a 


20 


2 




3 


13 


8 


3 






i 


49 34 


4 






4 


A> 23 


b 


4 


21 b 


S 


•1 




S 










6 


2\ 25 


) 


6 










7 
8 


: 















Chl-square » lb7,0 
Degrees of freedom • 4 



Chl-squrare • 49,0 
l)egre**s of freedom • 3 



Chi -square • 74, 1 
Degrees of freedom • 4 



All residua I a were derived fron a fit of the achievement score standard deviation 
against SKS standard deviation, achievement score mean, percent minority enrollment, and 
number of students tested. All ch>itquare statistics are significant beyond the U.OOS level. 



Table 4 

EXAMPLE OF THE USE OF NKU AClllKVKMKNT STATISTICS 



School Number 



Educational Objective 


Achievement Measure 


101 


1U2 


103 


104 etc. 


General achievement U>vel 


Mean 


2 


5 


3 


3 


Achlevemetit relative to student 












background 


Res idua I mean 


4 


3 


1 


4 


Equality of Ac)i l^^ vemont 


Spread ( pertiaps * ) 


I 


3 


2 


4 


Equalizing ertect of school 


Actual niiuis expected spread 


3 


1 


2 


> 


Mobility atiorJea by school 


Kesldiial spread 


2 


4 


2 


5 


Effectiveness with fxirept ional 












children 


Distortion (perhaps skevness) 


I 


3 


2 


5 


Effectiveness with over- and 












underachlt've rs 


Residu,il distort ion 


3 


5 


1 


3 


Assuring chllJrL'n .uhlevemont 


Hroportioa or students 










skills at ninir-.t.'^ Ifvel K 


(A K; 


2 


4 


2 


4 


Assuring chllJrL-n Jo no: undi»r- 


Proportion oi students 










Ai^hleve below levol i 


(K (*) 


5 


1 


4 


1 


Success wltl: r^*;\ .i^ovf 


Mean score ot <t'i-.ient8 above 










(belcArf) ba.'k»;r.^-.:i.J level S 


i ' 0 e I o^i." ) S 


3 


3 


1 


5 



Numbers under si'hools refer to i l»e tollowinyi table: 



Percent l ie i^t e^oj^- 

80-100 I 

60-80 2 

40-60 \ 

20-40 4 
0-20 

Percent Me b are romputed tor eah statist tc. 
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